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ABSTRACT 

Motivation: Metagenome analysis requires tools that can estimate the 
taxonomic abundances in anonymous sequence data over the whole 
range of biological entities. Because there is usually no prior know- 
ledge about the data composition, not only all domains of life but also 
viruses have to be included in taxonomic profiling. Such a full-range 
approach, however, is difficult to realize owing to the limited coverage 
of available reference data. In particular, archaea and viruses are gen- 
erally not well represented by current genome databases. 
Results: We introduce a novel approach to taxonomic profiling of 
metagenomes that is based on mixture model analysis of protein 
signatures. Our results on simulated and real data reveal the 
difficulties of the existing methods when measuring achaeal or viral 
abundances and show the overall good profiling performance of the 
protein-based mixture model. As an application example, we provide 
a large-scale analysis of data from the Human Microbiome Project. 
This demonstrates the utility of our method as a first instance profiling 
tool for a fast estimate of the community structure. 
Availability: http://gobics.de/TaxyPro. 
Contact: pmeinic@gwdg.de 

Supplementary information: Supplementary Material is available at 
Bioinformatics online. 
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1 INTRODUCTION 

Metagenomics has significantly enhanced the exploration of the 
biological diversity on our planet. Shotgun sequencing of envir- 
onmental DNA has provided a wealth of data for the analysis 
of the taxonomic and functional composition of a broad range 
of microbial communities. To make sense of the vast amount of 
sequences, novel tools had to be developed that could cope with 
the anonymous and fragmented nature of metagenomic data. 
In particular, to answer the classical 'who is there?' question, 
several specific problems have to be addressed. The short 
length of sequence fragments, the insufficient phylogenetic cover- 
age of current genome databases and the computational expense 
of the underlying algorithms are still limiting factors in taxo- 
nomic profiling of metagenomes today. Because we usually 
have no a priori knowledge about the composition of a meta- 
genome, a taxonomic profiling method, in principle, must be able 
to cover all three domains of life. To encompass the whole 
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spectrum of possible biological sources, besides bacteria, archaea 
and eukaryota, also viral entities have to be considered. A major 
problem of such a full-range taxonomic analysis arises from the 
limited coverage of genome databases that provide the required 
reference data for the characterization of novel sequences. In 
particular, archaea and viruses are generally not well-represented 
by current database genomes, making it difficult to obtain real- 
istic estimates of the corresponding abundances. 

The existing methods for taxonomic profiling can be divided 
into homology-based and model-based approaches. Among the 
homology-based approaches, most methods rely on a BLAST 
(Altschul et ai, 1990) similarity search of metagenomic sequences 
against a genomic reference database (e.g Huson et ai, 2007; 
Krause et al., 2008; Meyer et ai, 2008). In this case, the phylo- 
genetic classification of sequences is derived from the taxonomic 
labels of those reference sequences with a significant similarity. 
This, in principle, enables the measurement of the proportions of 
bacterial, archaeal, eukaryotic and viral sequences. To reduce the 
computational cost of the homology search step, the analysis can 
be restricted to certain signature (Segata et ai, 2012b) or marker 
(Liu et aL, 2011) genes. However, in the latter case, the estima- 
tion of viral abundances is complicated by the fact that there are 
no universal marker genes for viruses. 

Model-based approaches build a taxon-dependent, in most 
cases organism- specific, feature representation of reference gen- 
omes that is used for the analysis of metagenomic sequences. 
Currently, all methods rely on features derived from the oligo- 
nucleotide frequencies of the taxonomically labeled reference 
genomes (e.g. Brady and Salzberg, 2009; Rosen et aL, 2011), 
which are therefore often termed as composition-based methods. 
While in the case of homology-based methods the computational 
cost for taxonomic profiling increases with the number of refer- 
ence sequences, for model-based approaches, the cost just 
increases with the number of represented taxons. A good profil- 
ing accuracy of model-based methods has been reported for mi- 
crobial metagenomes that mainly comprise bacterial DNA (e.g. 
Diaz et ai, 2009; Meinicke et ai, 201 1; Parks et aL, 201 1; Rosen 
et aL, 2011). Recently, also the profiling of metagenomes with a 
substantial fraction of eukaryotic DNA has been addressed and 
promising results have been reported (Brady and Salzberg, 201 1). 
However, to our knowledge, there is no systematic study that 
investigates the ability of model-based approaches to measure 
the fraction of viral DNA. Conceptually, it is not clear whether 
unique viral oligonucleotide signatures exist that make them gen- 
erally distinguishable from microbial signatures. Besides a pos- 
sible host adaptation, there is no reason why viruses should 
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exhibit a typical codon usage or oligonucleotide profile. The only 
model-based tool that explicitly includes viral reference data is 
NBC (Rosen and Lim, 2012). However, as outlined by the au- 
thors, the NBC tool should not be used with mixed reference 
databases of, for example, viral and prokaryotic models because 
of a preference of NBC estimates for longer genomes. Thus, 
NBC virus models are not meant to supplement the microbial 
models for full-range taxonomic profiling and should therefore 
only be applied to pure virus data where no microbial contam- 
ination is anticipated. 

We here present the 'Taxy-Pro' method as the first model- 
based taxonomic profiling approach that uses protein signatures 
instead of the commonly used oligonucleotide signatures. The 
features of these protein signatures comprise the frequencies of 
protein domain families according to the Pfam database (Finn 
et aL, 2010). Because it does not make sense to estimate the 
protein signature of a single short sequence, we do not perform 
a classification of metagenomic reads by means of these protein 
features. Instead we make use of a mixture model for taxonomic 
profiling that has been introduced for oligonucleotide-based pro- 
filing of metagenomes (Meinicke et al., 2011). For mixture mod- 
eling of protein signatures, we aim to reconstruct the overall 
Pfam domain frequencies of a metagenome by a linear combin- 
ation of genomic reference signatures. Pfam domain hits have 
also been used as a basis for taxonomic profiling by the CARMA 
(Krause et al, 2008) and TreePhyler (Schreiber et al, 2010) tools. 
However, these tools do not analyze the overall domain frequen- 
cies, but perform a computationally expensive phylogenetic clas- 
sification of each sequence with a Pfam hit based on a multiple 
alignment with the Pfam reference sequences. 

In the evaluation of Taxy-Pro, we show that the protein-based 
mixture model provides important advantages when compared 
with the oligonucleotide-based model. We observed a higher 
accuracy of Taxy-Pro in the quantification of archaeal DNA, 
where the narrow spectrum of available database genomes re- 
quires a good generalization of the profiling method. In addition, 
we demonstrate that protein-based models can be built from viral 
metagenomes to obtain realistic estimates of the virus fraction. 
Our results indicate a strikingly higher sensitivity of the extended 
Taxy-Pro mixture model when compared with homology-based 
classification methods merely based on genomic reference data. 
Using the domain frequencies as obtained from the CoMet web 
server (Lingner et al, 201 1), computation is about three orders of 
magnitude faster than a speed- optimized BLAST analysis that 
includes viral metagenome sequences as reference data. 

Because of this computational efficiency, we were able to per- 
form a large-scale analysis of sequence data from the Human 
Microbiome Project (HMP, Peterson et al, 2009) without 
using a computer cluster or special hardware. In agreement 
with previous HMP studies (e.g. Huttenhower et al, 2012), we 
found that archaea are not among the abundant organisms in 
any of the samples. However, we observed a significant eukary- 
otic fraction in some samples, which has not been reported in the 
original studies. 

2 SYSTEM AND METHODS 

Our novel taxonomic profiling approach Taxy-Pro' is based on 
mixture modeling of the protein domain frequencies in a 



metagenome as described in the following. We argue that the 
use of functional reference profiles as mixture model components 
provides a powerful framework to cope with the inherent under- 
representation of certain biological entities in current genome 
databases. This problem is especially evident in the case of 
viruses, and, in fact, it seems not possible to accurately estimate 
the fraction of viral DNA in metagenomes merely based on gen- 
omic reference data. To overcome this limitation, we explored 
the possibility of including metagenomic reference data to im- 
prove the profiling accuracy. While the inclusion in Taxy-Pro is 
straightforward, a classical BLAST-based pipeline suffers from 
the computational burden of such an extension. We therefore 
realized a speed-optimized 'Combi-BLAST' method to provide 
a classical 'baseline' approach for comparison with Taxy-Pro (see 
Supplementary Material). 

2.1 Protein-based mixture modeling 

Our new method for taxonomic profiling combines the detection 
of protein domains with a mixture model reconstruction of 
the resulting domain frequencies by means of taxonomically 
labeled protein reference signatures. Thus, our Taxy-Pro 
approach comprises two steps: first, we estimate the overall pro- 
tein domain distribution of a metagenome by relative frequencies 
of Pfam hits using the CoMet domain detection engine. The 
resulting profiles comprise the relative frequencies of 12 621 pro- 
tein domain families according to release 24 of the Pfam data- 
base. Then we approximate the Pfam profile vector y of a 
metagenome by linear combination of the precalculated protein 
reference signatures x* with mixing weights w t according to 

M 

y = ^w/-x f . (1) 

i=\ 

For estimation of the mixture weights, we use the Expectation- 
Maximization (EM) algorithm (Dempster et al, 1977), whereby 
Y^iL\ Wi = 1 and Wi > 0. Summing up the mixture weights of all 
viral reference signatures, we finally obtain an estimate of the 
virus fraction in a given metagenome. The fraction of bacterial 
and archaeal DNA as well as the composition on more specific 
phylogenetic levels was estimated in the same way by summing 
up the weights of the corresponding reference signatures. 

The approximation error of the above mixture model, i.e. the 
divergence between the original Pfam profile vector y and its 
approximation y, we refer to as the fraction of domain hits un- 
explained (FDU). The FDU was calculated according to 

1 N 

fdu = 9Ei»-»i (2) 

7=1 

which is half the Manhattan distance between the two profile 
vectors and therefore ranges between 0 and 1. In terms of the 
used protein signatures, the FDU measures the fraction of Pfam 
domain hits that could not be reconstructed by the model, i.e. the 
fraction that is missing in the reconstructed profile vector y of 
predicted domain frequencies. Owing to symmetry, there is an 
equally sized fraction of domain hits that are overpredicted by 
the model, i.e. that are not present in the actual metagenome 
profile y. Together, these two fractions make up the complete 
Manhattan distance according to the overall L x approximation 
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error. This error is an important indicator of the reliability of 
the abundance estimates: the better the model fits the actual 
distribution of Pfam hits, i.e. the lower the FDU value, the 
more we can trust the taxonomic profile that directly results 
from the estimated mixture weights. Note that half the 
Manhattan distance can also be used to measure the divergence 
between two taxonomic profiles, whereby the y-values represent 
different relative abundances of phyla. In this context, the dis- 
tance measure corresponds to the Bray-Curtis dissimilarity, 
which is widely used in ecology for comparison of two assem- 
blages (Bray and Curtis, 1957). 

2.2 Reference signatures from viral metagenomes 

For the reference signature vector profiles x /? we not only used 
genomic Pfam profiles of several prokaryotes, eukaryotes and 
viruses, but we also used a collection of metagenomes for com- 
putation of reference vectors. Because the number of available 
phage genomes, which have a sufficient length for estimation of 
Pfam profiles, is rather small and variation of viral signatures 
is large, we use additional reference signatures obtained from 
a collection of viral metagenomes. In total, Taxy-Pro includes 
92 phage genomes and 102 viral metagenomes in addition to 
1730 bacterial, 122 archaeal and 50 eukaryotic genomes to 
obtain 2096 functional reference signatures. Because the estima- 
tion of Pfam profiles requires a sufficient amount of DNA 
sequence, we only use phage genomes with a length >100 kbp 
for computation of a reference signature. 

Depending on the particular treatment, the viral metagenome 
data in current databases contain a varying amount of microbial 
contamination. For reasons of accuracy the mixture approach 
should not be used with viral signatures that arise from sequence 
data with only a small virus fraction. Therefore, we implemented 
a selection criterion that requires the reference signatures from 
viral metagenomes to be well distinguishable from microbial 
metagenome signatures. Using the Pfam profiles of 260 metagen- 
omes from the CAMERA website (http://camera.calit2.net/, 
as of April 2010), we trained linear classifiers to discriminate 
between viral and microbial metagenomes. 

A step- wise elimination was applied to reduce the signatures: 
regularized least squares classifiers on Pfam domain frequencies 
(Lingner et al, 2010) were optimized by a 5-fold cross-validation 
to minimize the classification error. The optimization was per- 
formed 10 times, each time with a different random partition for 
the cross-validation. Therefore, each signature was used 10 times 
as a test example and the one with the highest misclassification 
rate over the 10 runs was eliminated. The whole procedure was 
repeated for the reduced set of signatures until no test error 
occurred during the 10 cross-validation runs. In that way the 
original number of 151 viral metagenomes was reduced to 102 
viral reference signatures that were finally included in the mixture 
model (see Supplementary Table S3). 

In contrast to the oligonucleotide signatures in the original 
Taxy approach, the Pfam signatures in Taxy-Pro depend on se- 
quence length. To make genomic and metagenomic Pfam profiles 
compatible, we fragmented the genomic sequences before per- 
forming the CoMet domain detection. For that purpose, we 
used half overlapping fragments with a 400 bp length. 



3 IMPLEMENTATION 

Taxy-Pro is implemented in the MATLAB programming lan- 
guage as an extension of the platform-independent and freely 
available Taxy toolbox for MATLAB/Octave (http://gobics.de/ 
Taxy Pro). In addition, Taxy-Pro has been included in the CoMet 
web server (http://comet.gobics.de/) to make the method access- 
ible via an easy-to-use interface. 

4 EXPERIMENTAL EVALUATION 

Taxonomic profiling of metagenomes involves two dimensions of 
resolution: the depth of the analysis indicates the specificity of 
taxonomic categories along the phylogenetic tree and increases 
from upper to lower levels. The breadth or range of the analysis 
indicates the completeness of taxonomic categories on a particu- 
lar level and increases with the number of branches that are 
actually included in the estimate. While it may be possible in 
particular cases, e.g. in metagenomic diagnostics, to measure 
the taxon abundances on phylogenetically rather specific levels 
down to genus or even species rank, this would be difficult to 
achieve for the general case. Without prior knowledge about 
the true metagenome structure, it is therefore important to first 
obtain a coarse but reliable high-level estimate that covers 
the full range of biological entities. Ideally, this estimate includes 
the fractions of all domains of life and viruses. In our experimen- 
tal evaluation, we want to point out the difficulties of such 
a full-range approach and we provide two comparative studies 
to show the large divergence of results from different tools even 
on the highest taxonomic level. Finally, we investigate the add- 
itional possibility of Taxy-Pro to characterize the uncertainty 
of a composition estimate by means of the model approximation 
error. 

4.1 Measuring archaeal DNA in metagenomes 

Microbial genome projects have introduced a strong bias in 
the databases toward certain kinds of terrestrial bacteria. As a 
consequence, archaeal organisms are currently underrepresented 
in genome databases, which complicates the detection of 
archaeal DNA in metagenomes because taxonomic profiling 
methods inherently depend on the available reference genomes. 
To investigate how this underrepresentation affects the compos- 
ition estimates, we compared method- specific predictions for 
archaea and bacteria using real metagenomic data and simulated 
sequencing reads. 

The metagenomic dataset we used for comparison of different 
profiling methods is derived from oil seep metagenome samples 
as described in (Havelsrud et aL, 201 1). For the Oil Seep data, a 
large proportion of archaeal DNA (see Supplementary Material) 
has been reported in the original study. 

To investigate whether this finding could be reproduced 
with a broad range of methods, we analyzed the results of 
11 different tools for taxonomic profiling (see Supplementary 
Material). For comparison of the composition estimates, we 
considered only assignments to the two superkingdom cate- 
gories 'Archaea' and 'Bacteria' and required the two fractions 
to yield unit sum. Figure 1 shows the results of our analysis, 
which reveals that the estimated fraction for archaea varies 
considerably across different methods. For instance, the 
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Fig. 1. Estimated fraction of archaeal and bacterial DNA for the Oil 
Seep dataset using different tools 



oligonucleotide-based NBC method attributed only 1.6% of 
the classified reads to this domain. On the other extreme, 
Taxy, MEGAN, Taxy-Pro, WebCARMA and MetaPhyler 
predicted a relatively large fraction of archaeal DNA (29.0, 
28.1, 26.3, 23.4, 21.7%, respectively). The remaining methods 
produced archaea fraction estimates of ~10%. This broad 
spectrum of estimates indicates that the different approaches 
are obviously affected in different ways by the 
above-mentioned database bias. We also applied MetaPhlAn 
(Segata et al, 2012b) to this dataset, but no valid taxonomic 
assignments could be found with that tool, possibly owing to 
the low coverage of signature gene hits and the used classifi- 
cation rule (see Supplementary Material). 

To further investigate the high degree of disagreement between 
different approaches, we conducted a simulation study where 
we used two particular archaeal genomes to generate sequence 
fragments for taxonomic profiling. For archaea in current data- 
bases, the phylogenetic coverage in terms of the number of 
fully sequenced organisms is rather low for most of the lower 
rank taxonomic categories. Therefore, it was possible to choose 
two genomes from rather isolated branches to simulate sequen- 
cing reads with a high degree of phylogenetic novelty. To obtain 
transparent accuracy measurements, we performed two inde- 
pendent tests on species-specific collections of sequences accord- 
ing to the two archaeal organisms. For each collection, 
we measured the profiling accuracy in terms of the estimated 
abundances of the true taxonomic categories the organisms 
belong to. In the ideal case, the estimated fraction for each 
of the annotated taxons is equal to 1. For a comparative evalu- 
ation, we selected five tools for which we could explicitly choose 
the used reference genomes to provide a strict separation be- 
tween training and test data. For the sequence classification 
methods (PhymmBL, MEGAN, SOrt-ITEMS), we only used 
reads with a valid taxonomic classification for abundance esti- 
mation. While Taxy estimates are based on the evaluation of 
all reads, Taxy-Pro only uses valid Pfam hits for profiling. 
We included SOrt-ITEMS (Monzoorul Haque et al, 2009) 
as another protein homology-based tool to investigate the 
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Fig. 2. Accuracy of different profiling methods on test datasets from two 
archaeal genomes using simulated 450 bp reads (see text) 



impact of a modified classification rule as compared with the 
original lowest common ancestor (LCA) rule of MEGAN 
(Huson et al, 2007). 

We used MetaSim (Richter et al, 2008) to simulate reads for 
the two archaea Acidilobus saccharovorans 345-15 (As) and 
Methanothermus fervidus DSM 2088 (Mf) (see Supplementary 
Material). Both organisms have been included in NCBI in 
2011 and are unique at the rank of order, i.e. there are no over- 
laps on order level (and below) with the used reference organ- 
isms. Also the Pfam release 24 used for CoMet domain detection 
did not include sequences from these species and correspondingly 
we excluded all associated hits from the BLAST results in our 
comparative study. To simulate different read lengths, we gener- 
ated two test sets with an average fragment length of 250 and 
450 bp, respectively. 

Figure 2 shows the profiling accuracy of different methods in 
terms of the abundances of the correct taxons on superkingdom, 
phylum and class level for the 450 bp sequence datasets. In case 
of MEGAN/SOrt-ITEMS, the percentage of reads with valid 
taxonomic assignments was 64.9/77.5% for the As data and 
80.8/90.5% for Mf sequences. For Taxy-Pro estimates, 45.3% 
and 65.5% of the reads were used from As and Mf data, respect- 
ively. With PhymmBL and Taxy, all reads were used for abun- 
dance estimation. On superkingdom level, Taxy-Pro and the two 
protein homology-based approaches showed good results with 
nearly 90% and above for both, As and Mf, genomes. In con- 
trast, the olignucleotide-based Taxy and PhymmBL tools broke 
down to ~42% and 34% for the Mf genome, indicating a sub- 
stantially lower generalization capability in that case. As ex- 
pected, the accuracy of all methods decreases for lower 
phylogenetic levels. Here, SOrt-ITEMS showed the largest dif- 
ference from superkingdom to class level at 450 bp read length 
for both genomes (48.9 and 51.2 percentage points [p.p.] differ- 
ence, respectively). Comparing the fragment length-induced dif- 
ferences for each method, we found the divergences to be small 
(<5p.p.) in general, with only a few exceptions. The maximum 
deviation we observed on class level for the SOrt-Items tool, 
which showed a discrepancy of 20.4/10.7 p.p. (As/Mf). The com- 
plete results for the simulated 250/450 bp reads are shown in 
Supplementary Tables S4 and S5. 
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4.2 Measuring viral DNA in metagenomes 

Considering the immense diversity and variation of viral gen- 
omes, viruses are notoriously underrepresented in genome data- 
bases today. This fact also complicates the evaluation of the 
ability of different taxonomic profiling approaches to predict 
the fraction of viral DNA in metagenomes. To circumvent the 
difficulties that arise from the sparse taxonomy and limited 
coverage of viral genomes in current databases, we avoided the 
use of simulated data in the evaluation of the profiling accuracy. 
Instead, we made use of the rich resources in viral metagenomics 
(e.g. Mokili et aL, 2012) and chose two viral metagenomes for 
evaluation of different profiling methods. The Freshwater and 
Rumen datasets (see Supplementary Material) have both been 
obtained using protocols for enrichment of viral DNA. The 
majority of sequences in these collections are of viral origin al- 
though some microbial contamination cannot be excluded. As 
indicated in Section 1 , not all of the before-mentioned tools can 
actually be used to measure viral DNA in metagenomes. Because 
of the evidence against oligonucleotide-based approaches, we re- 
stricted our comparative analysis to homology-based techniques 
that are widely used in viral metagenomics. 

For our analysis, we considered the method-specific propor- 
tion estimates on superkingdom level including the fractions of 
archaea, bacteria, eukaryota and viruses with abundances nor- 
malized to yield unit sum. When examining the taxonomic pro- 
file as predicted by the different methods, we observed a huge 
variation in the estimated virus fraction ranging between 0 and 
80.4% (Fig. 3). Remarkably, the WebCARMA tool did not clas- 
sify any reads as viral in any of the two test datasets. Applying 
the LCA classification rule to the BLASTN output (see 
Supplementary Material) resulted in virus fraction estimates of 
11.8% and 2.8% in Freshwater and Rumen, respectively. A simi- 
lar proportion was obtained by the MG-RAST pipeline, which 
attributed 4.2% and 7.3% to viral categories. Using MEGAN, 
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Fig. 3. Estimated fraction of viral DNA in the Freshwater (top) and 
Rumen (bottom) metagenomes using different profiling tools (see text) 



predicted proportions of 19.2% and 5.8% could be achieved. 
These were the highest values we obtained with a tool that is 
based on genomic reference sequences only. Using viral metagen- 
omes as an additional reference, our Combi-BLAST pipeline (see 
Supplementary Material) estimated 73.4% and 39.6% viral frac- 
tions for Freshwater and Rumen, respectively. The predicted rates 
were even higher for Taxy-Pro, which produced estimates of 
80.4% and 74.0% for these datasets. For both samples, some 
microbial contamination can actually be expected. In case of the 
more recent Rumen data, the authors identified an unspecified 
amount of bacterial contamination in the rumen virome sample 
using 16S rRNA amplification (Berg Miller et aL, 2012). This 
may explain the still significant fraction of bacteria in the 
Taxy-Pro estimate, which corresponds well with the original 
study. All methods agreed in predicting only a small fraction 
of archaea in the viral metagenomes (Fig. 3). Only the 
WebCARMA and BLASTN-LCA methods detected a relatively 
high fraction (~10%) of eukaryotic DNA in the Rumen dataset. 
The relatively low fraction of virus DNA predicted by 
Combi-BLAST for this dataset indicates a significantly higher 
sensitivity of the Taxy-Pro approach. An additional evaluation 
on viral metagenome data that has been obtained with the most 
recent protocols (Duhaime et aL, 2012) showed similiar results 
with an even higher virus fraction predicted by Taxy-Pro (see 
Supplementary Figure SI). 

4.3 Large-scale analysis of Human Microbiome data 

The sequence data produced by the HMP (Peterson et aL, 2009) 
is the first extensive collection of samples from diverse human 
body sites with a large number of subjects, allowing for a more 
precise insight into the composition of the healthy human micro- 
biome. With more than thousand samples, the HMP provides a 
wealth of data for large-scale comparative studies (e.g. Segata et 
aL, 2012a). As a major difference to studies that use particular 
marker genes for taxonomic profiling (e.g Huttenhower et aL, 
2012), we here analyze all available HMP datasets in terms of 
their taxonomic composition including all domains of life and 
viruses. In addition, we investigated the uncertainty of the com- 
position estimates by means of the Taxy-Pro model quality. 

Model quality For taxonomic profiling of metagenomes, it 
is important to characterize the uncertainty of the composition 
estimates. In a BLAST-based analysis, the proportion of se- 
quences without significant hits can be used as an indicator of 
uncertainty: the smaller the number of sequences that are actu- 
ally used to estimate the proportions, the higher the uncertainty 
of the corresponding estimates. In Taxy-Pro, we have a corres- 
ponding measure, namely the amount of sequences without 
Pfam domain hits, which we refer to as the fraction of sequences 
unexplained (FSU). Furthermore, with the Taxy-Pro mix- 
ture model, it is possible to get an even more comprehensive 
characterization of the uncertainty by taking into account the 
approximation error of the model in terms of the FDU (see 
Section 2). 

To get an overview of the uncertainty in the estimates for the 
HMP data, we assessed the Taxy-Pro model quality for all 
datasets in terms of the FSU and FDU values. Figure 4 shows 
a scatter plot of the two model quality indices for all 1494 
samples. Here, the markers associated with the datasets are of 
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different types according to whether they represent 'joined 
paired-end' or 'singleton' reads, respectively (see Supplementary 
Material). Ideally, for reliable composition estimates, a dataset- 
related marker should be located in the lower left corner of this 
plot, as a low FDU indicates a well-fitting model, while a low 
FSU signals a solid statistical basis according to a high fraction 
of sequences with significant Pfam domain hits. As the stretched 
shape of the marker distribution in Figure 4 suggests, the FSU 
values of the HMP datasets exhibit a broader variation than the 
FDU values. As expected, the data points associated with single- 
ton reads generally show a higher FSU because the effective 
length of sequences is shorter for the singletons than for the 
complete paired-end reads. Remarkably, the singleton read 
data show a considerably higher amount of outliers in the 
right-hand tail of the FDU distribution. This may be attributed 
to the fact that the singletons result from a quality check where 
one part of the paired-end reads did not satisfy the requirements. 
On average, this may also imply a slightly lower sequence quality 
of the remaining singletons, as suggested by the horizontal shift 
of the corresponding (+) cluster in the scatter plot. 

An FDU value close to 1 indicates a large approximation error 
of the Taxy-Pro model for a particular dataset and thus ques- 
tions the estimates resulting from the model. To investigate pos- 
sible reasons for an increased FDU, we exemplarily inspected 
two datasets associated with an extremely high FDU value 
(markers 'A' and 'B' in Fig. 4). In one case, the source of uncer- 
tainty was quickly identified: the data file associated with marker 
'A' contains <50 reads, which does not provide a statistical basis 
for a reliable composition estimate, irrespective of the utilized 
profiling method. In contrast, the dataset corresponding to 
marker 'B' comprised >60 000 reads. However, an additional 
BLASTX search against NCBI-nr in this case only showed 323 
sequences with significant similarity hits. With such a low cover- 
age of similarity hits also a homology-based profile estimation 
would be difficult to justify. 

On the other side, a low dataset- specific FDU value alone 
should not be considered a guaranty for a reliable composition 
estimate. For instance, the sequence data associated with 
marker 'C in Figure 4 exhibited an extremely low domain hit 
coverage (FSU: >0.95) paired with a low approximation error 



(FDU: <0.2). Most reads of this dataset include several unspeci- 
fied nucleotides within short sequence distance, which prevents 
the used gene prediction by CoMet from generating a sufficient 
amount of ORFs longer than the required minimum of 60 bp. 
In this case, the low fraction of domain hits as compared with 
the number of sequences suggests to cross-check the resulting 
composition estimates with a different profiling approach. 

Because of the overall higher FSU and FDU values for the 
singleton samples as compared with the joined paired-end reads, 
we excluded the singleton datasets from the subsequent taxo- 
nomic profiling analysis with Taxy-Pro. In particular, the 
many outliers among the singleton samples with rather high 
FDU values indicate a large quality variation not suitable to 
provide a consistent statistical basis. From the remaining data, 
we only excluded one paired-end file with an exceptional FDU of 
0.83 to achieve a final number of 749 datasets for the subsequent 
composition analysis. 

Composition estimates Taxonomic profiling of the 749 HMP 
samples revealed that, as expected, bacteria make up the most 
abundant fraction in all samples with values ranging between 
62.5 and 99.9%. On average, bacteria represented 97.9% 
(±4.1%) of the estimated abundances. In agreement with the 
original HMP studies (e.g. Huttenhower et aL, 2012), we found 
that archaea are not among the abundant organisms in any of 
the samples with estimated fractions <1%. The estimated virus 
fractions were mostly in the range of the prediction error 
(0.8 ±0.9%), in almost all cases <5% with only five samples 
slightly above and with one single sample showing 13.7%. In 
contrast to the low archaea and virus proportions, we observed 
a significant eukaryotic fraction in some samples, exceeding 10% 
in 24 samples with a maximum of 36.3% in one case (on average 
1.2 ±3.8%). This maximum eukaryotic fraction was estimated 
for a sample from anterior nares (skin, marker 'D' in Fig. 4), 
which we further analyzed with MEGAN. The analysis also 
resulted in a high predicted fraction of eukaryotic DNA 
(46.7%), whereby Malassezia globosa represented the most 
frequent Eukaryota species. While members of the genus 
Malassezia are lipophilic or lipid-dependent yeasts that are part 
of the normal cutaneous microflora, the excessive abundance 
of this pathogenic fungus is associated with prevalent skin 
disorders (Xu et aL, 2007). 

5 DISCUSSION 

We have shown that the functional profile of a metagenome in 
terms of the overall protein domain frequencies can be used to 
infer its taxonomic composition over the whole range of biolo- 
gical entities. In our evaluation, we demonstrated the difficulties 
of the existing tools in providing full range estimates that include 
the proportions of all domains of life and viruses. In particular, 
a reliable estimation of archaeal and viral fractions constitutes 
a major challenge for taxonomic profiling tools. 

Considering the coverage of current genome databases, we 
found that it is necessary to include reference data from viral 
metagenomes to obtain realistic estimates of the virus fraction. 
Our results indicate that the restriction to genomic reference data 
in current taxonomic profiling tools leads to systematic under- 
estimation of the virus fraction. For that reason, a large number 
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of taxonomically unassigned reads is sometimes taken as an in- 
direct indicator for a high virus fraction. In general, such kind of 
negative evidence is not valid and can be completely misleading 
in metagenomics. Taxy-Pro takes a different approach and per- 
forms a taxonomic analysis of the protein signature, which arises 
from all functionally assigned reads. The Pfam profile may in 
fact provide positive evidence for viral DNA according to a 
high similarity with viral reference signatures. Like no other 
approach, our Taxy-Pro mixture model allows the integration 
of a large amount of viral reference data from metagenomes in 
a highly efficient manner. Nevertheless, the limited number and 
the small range of environments covered by the current data can 
be expected to introduce a strong bias into the estimation of the 
viral DNA fraction. Here, our hope is that the rapidly increasing 
number of sequencing projects will also provide us with a larger 
number and a broader range of viral metagenomes in the 
sequence databases. 

Also, the quality of the data in terms of a possible contamin- 
ation is an important aspect. In fact, many of the viral datasets 
can be expected to contain a significant fraction of microbial 
DNA. To cope with the contamination problem, we have already 
shown a possible first solution in terms of a reference selection 
criterion, which is useful to detect and exclude datasets that 
cannot be well distinguished from microbial metagenomes on 
the basis of protein domain signatures. Despite the possible con- 
tamination, among all state-of-the-art methods, the Taxy-Pro ap- 
proach showed by far the most sensitive estimation of viral DNA 
abundances and therefore our method can significantly contribute 
to quality control in microbial and viral metagenomics. 

We are aware of the fact that our Taxy-Pro approach relies on 
the detection of protein domains and therefore is restricted to the 
coding part of metagenomic DNA. Furthermore, the model 
quality depends on a sufficient number of Pfam hits, which is 
difficult to achieve with small datasets or sequence data with 
large error rates. For instance, a high density of undetermined 
bases in sequencing reads can substantially decrease the Pfam 
detection rate (see Section 4.3). Using the domain detection of 
the CoMet web server, the computation of Taxy-Pro estimates is 
fast (see Supplementary Table SI) but susceptible to sequencing 
errors. Metagenomic gene prediction tools that can, in principle, 
compensate for sequencing errors (Rho et al, 2010) could im- 
prove the situation in this case. In addition, HMMER3 (Finn 
et al, 2011) may be used instead of the faster but less sensitive 
CoMet engine to further increase the number of Pfam domain 
hits. We suppose that in principle also approaches based on 
signature genes, like MetaPhlAn, could be used for abundance 
estimation across the full taxonomic range. However, the extrac- 
tion of clade-specific genes that provide a sufficient coverage 
might be challenging for viruses. In contrast, Taxy-Pro does 
not require the preselection of specific signature genes but instead 
considers all Pfam domain frequencies as part of an overall pro- 
tein signature. 

As a novel feature of our approach, Taxy-Pro provides two 
complementary indices to assess the model quality. The fraction 
of sequences without Pfam domain hits (FSU) and the FDU can 
both be used to characterize the uncertainty of a composition 
estimate. Our large-scale evaluation of datasets from the Human 
Microbiome Project (HMP) indicates that especially the FDU 
offers new possibilities for fault detection in metagenomics. 



Our evaluation on HMP data also shows that Taxy-Pro does 
not obviate the need for other sequence analysis tools. On the 
contrary, as a fast pre-screening tool Taxy-Pro should be used in 
combination with state-of-the-art sequence classification tools to 
extract a maximum of information from metagenomic data. 
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